Skip to content

fix(graph): introduce Status.SKIPPED for cancel_node; graph continues downstream#2258

Open
Zelys-DFKH wants to merge 2 commits into
strands-agents:mainfrom
Zelys-DFKH:fix/graph-cancel-node-graceful-skip
Open

fix(graph): introduce Status.SKIPPED for cancel_node; graph continues downstream#2258
Zelys-DFKH wants to merge 2 commits into
strands-agents:mainfrom
Zelys-DFKH:fix/graph-cancel-node-graceful-skip

Conversation

@Zelys-DFKH
Copy link
Copy Markdown
Contributor

@Zelys-DFKH Zelys-DFKH commented May 6, 2026

Description

When cancel_node is set in a BeforeNodeCallEvent hook, _execute_node raised RuntimeError after emitting MultiAgentNodeCancelEvent, which propagated through _execute_nodes_parallel and killed the graph. Downstream nodes never ran, the overall status became FAILED, and interrupt-and-resume workflows had no recovery path.

The fix introduces Status.SKIPPED. The bypassed node records result=None and status=Status.SKIPPED, lands in completed_nodes so downstream readiness checks are satisfied, and returns cleanly. Using Status.COMPLETED here would be wrong: a node that never ran is not completed, and conflating the two makes it impossible to distinguish "ran successfully" from "intentionally bypassed" in loop control and observability tools.

Three fixes came alongside:

  • _build_node_input filters SKIPPED deps before building dependency_results, so when all upstream deps are skipped the downstream node receives the original task with no orphaned "From node_x:" header
  • _from_dict reads the persisted NodeResult.status to restore Status.SKIPPED correctly on resume, rather than hardcoding Status.COMPLETED for everything in completed_nodes
  • GraphResult gains skipped_nodes: int; completed_nodes now counts only nodes that actually ran

Also opened #2401 for the complementary CANCELLED semantic (abort-branch, downstream does not run). That issue raises the cancel_nodeskip_node naming question; if maintainers prefer Option A, the rename can happen in this PR before it merges.

Related Issues

Resolves #2240

Documentation PR

N/A

Type of Change

Bug fix

Testing

Ran hatch run prepare. Updated test_graph_cancel_node (parametrized ×2) to assert Status.SKIPPED on both NodeResult.status and node.execution_status. Updated test_graph_cancel_node_downstream_executes to assert: downstream node executes (step_b.stream_async.assert_called_once()), step_b receives only the original task with no orphaned header, graph_result.skipped_nodes == 1, and graph_result.completed_nodes == 1.

  • I ran hatch run prepare

Checklist

  • I have read the CONTRIBUTING document
  • I have added any necessary tests that prove my fix is effective or my feature works
  • I have updated the documentation accordingly
  • I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@github-actions github-actions Bot added the size/s label May 6, 2026
@yonib05 yonib05 added area-multiagent Multi-agent related area-hooks Features or requests that might be implementable via hooks labels May 27, 2026
@yananym
Copy link
Copy Markdown
Contributor

yananym commented May 29, 2026

Thanks for working on this @Zelys-DFKH — this addresses a real pain point. I filed the original issue (#2240) and have been thinking about the design here, especially considering future graph looping support (revisiting previous nodes via edge conditions).

Proposal: Introduce two distinct statuses — SKIPPED and CANCELLED

After looking at how the TypeScript Strands SDK and LangGraph handle this, I think we should distinguish between two semantics:

1. SKIPPED — skip this node, continue downstream (what this PR does)

This is the "skip and continue" semantic. The node is intentionally bypassed, but downstream dependents should still execute. This is what my bug report asks for.

Design:

  • Introduce Status.SKIPPED enum value (not COMPLETED — a node that never ran shouldn't be marked as completed)
  • Dependency resolution treats SKIPPED the same as COMPLETED — the node counts as "satisfied" for downstream readiness checks in _find_newly_ready_nodes / _is_node_ready_for_resume
  • _build_node_input should skip the header entirely for SKIPPED deps (no empty "From node_x:" with nothing underneath)
  • reset_on_revisit clears SKIPPED nodes (same as COMPLETED) — they become eligible for execution again on loop re-entry
  • The NodeResult should not store a RuntimeError — a skipped node isn't an error. Use a lightweight marker (e.g., None result or a dedicated sentinel) so get_agent_results() cleanly returns []

The SKIPPED status is valuable because:

  • It's semantically honest — a node that never ran isn't "completed"
  • Edge conditions on outgoing edges can inspect execution_status == SKIPPED vs COMPLETED and route differently (powerful for loop control)
  • Observability — dashboards/logs can distinguish "ran successfully" from "intentionally bypassed"
  • No ambiguity with nodes that ran but legitimately returned empty output

2. CANCELLED — kill this branch gracefully (align with TS SDK)

This is what the TypeScript SDK does with beforeEvent.cancel: the node is cancelled, marked as a terminus, and downstream nodes do NOT execute. No error propagates — the graph can still complete via other branches.

This could be a separate hook field (e.g., cancel_node for cancel, skip_node for skip) or a single field with enum values ("skip" vs "cancel").

Why both are needed (especially for looping)

Consider a loop: A → B → C → (edge back to A if condition)

  • SKIPPED: On iteration 2, B is skipped because it already gathered data. C still runs, the back-edge fires, and the loop continues. This is the common case for conditional execution within loops.
  • CANCELLED: You want to abort a branch entirely — e.g., an optional side-branch that's no longer relevant. The main loop path continues via other branches.

Without the distinction, you can't express "skip but continue" vs "stop this path" — both of which are legitimate control flow decisions.

Complementary mechanism: Edge conditions with invocation_state

PR #2305 (merged) added invocation_state to edge condition evaluation, which provides an alternative way to route around nodes at the edge level. The two mechanisms are complementary:

  • Edge conditions: Control whether to traverse a path (node never enters the execution batch)
  • skip_node / cancel_node hooks: Control whether to execute a node that's already been reached (node is in the batch but gets bypassed)

Both are needed because edge conditions are evaluated before batch formation, while hooks fire after.


Happy to collaborate on the implementation if this direction makes sense. The core of your PR is right — the change from "raise RuntimeError" to "return cleanly" is exactly the fix. The suggestion is to layer proper status semantics on top.

@yonib05 yonib05 added python Pull requests that update python code bug Something isn't working labels May 29, 2026
@Zelys-DFKH
Copy link
Copy Markdown
Contributor Author

Agreed on SKIPPED vs COMPLETED. Updated; your analysis of the header-leak risk in _build_node_input was spot-on, too.

result=None, Status.SKIPPED throughout. _build_node_input filters SKIPPED deps before building dependency_results, so when all upstream deps are skipped the downstream node gets just the original task with no orphaned "From step_a:" header. _from_dict reads the persisted NodeResult.status to restore Status.SKIPPED correctly on resume. Also added skipped_nodes to GraphResult and adjusted completed_nodes to exclude SKIPPED nodes, which closes the observability gap.

SKIPPED nodes live in completed_nodes rather than a separate set. That satisfies both _find_newly_ready_nodes and _compute_ready_nodes_for_resume without touching either function. reset_on_revisit works correctly for loop re-entry, too: SKIPPED nodes are in completed_nodes, so they get cleared and become eligible for execution again on the next iteration.

edge.should_traverse is still evaluated for outgoing edges of SKIPPED nodes, which preserves the division you described: edge conditions control whether to enter a path before batch formation, hooks control whether to execute after.

CANCELLED (abort-branch semantics) is a separate concern; I'll open an issue.

Replace Status.COMPLETED + RuntimeError with Status.SKIPPED + None for
nodes bypassed via cancel_node. A skipped node never ran and should not
be marked completed; downstream readiness checks treat SKIPPED the same
as COMPLETED by keeping skipped nodes in completed_nodes.

- Add Status.SKIPPED enum value
- NodeResult.result accepts None; to_dict/from_dict round-trip via
  {"type": "skipped"}; get_agent_results() returns [] for None
- _build_node_input filters SKIPPED deps before building
  dependency_results, so downstream nodes receive only the original
  task with no orphaned "From node_x:" header
- _from_dict restores Status.SKIPPED on node.execution_status via
  the persisted NodeResult.status rather than blindly setting COMPLETED
- GraphResult gains skipped_nodes counter; completed_nodes now counts
  only nodes that actually ran
- reset_on_revisit correctly clears SKIPPED nodes for loop re-entry
  (they are already in completed_nodes)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@Zelys-DFKH
Copy link
Copy Markdown
Contributor Author

Opened #2401 to track CANCELLED (abort-branch semantics). It also surfaces the cancel_nodeskip_node naming question — happy to make that rename in this PR if maintainers prefer Option A before it merges.

@Zelys-DFKH Zelys-DFKH changed the title fix(graph): treat cancel_node as control flow, not fatal error fix(graph): introduce Status.SKIPPED for cancel_node; graph continues downstream May 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-hooks Features or requests that might be implementable via hooks area-multiagent Multi-agent related bug Something isn't working python Pull requests that update python code size/m

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] cancel_node in BeforeNodeCallEvent raises RuntimeError that kills the entire graph on resume

3 participants